The chromosome-level genome assembly of the giant dobsonfly Acanthacorydalis orientalis (McLachlan, 1899)

Acanthacorydalis orientalis (McLachlan, 1899) (Megaloptera: Corydalidae) is an important freshwater-benthic invertebrate species that serves as an indicator for water-quality biomonitoring and is valuable for conservation from East Asia. Here, a high-quality reference genome for A. orientalis was constructed using Oxford Nanopore sequencing and High throughput Chromosome Conformation Capture (Hi-C) technology. The final genome size is 547.98 Mb, with the N50 values of contig and scaffold being 7.77 Mb and 50.53 Mb, respectively. The longest contig and scaffold are 20.57 Mb and 62.26 Mb in length, respectively. There are 99.75% contigs anchored onto 13 pseudo-chromosomes. Benchmarking Universal Single-Copy Orthologs (BUSCO) analysis showed that the completeness of the genome assembly is 99.01%. There are 10,977 protein-coding genes identified, of which 84.00% are functionally annotated. The genome contains 44.86% repeat sequences. This high-quality genome provides substantial data for future studies on population genetics, aquatic adaptation, and evolution of Megaloptera and other related insect groups.

molecular convergences in aquatic insects during convergent amino acid substitutions and gene family evolution was also provided.For the aquatic chironomid larvae which can tolerant water pollution, expansion of the gene family related to detoxification metabolism has been found with adaptation to such hazard environment.Conversely, as the corydalid larvae are sensitive to the deterioration of freshwater habitat, uncovering the gene family related to detoxification metabolism of corydalid species is crucial for understanding the specific preference for clean freshwater habitats of Megaloptera.
Here, we assembled a high-quality chromosome-level genome of A. orientalis by using Oxford Nanopore sequencing and High throughput Chromosome Conformation Capture (Hi-C) technology.The final genome size is 547.98 Mb, with the N50 value of contig being 7.77 Mb.Benchmarking Universal Single-Copy Orthologs (BUSCO) analysis showed that the completeness of the genome assembly is 99.01%.There are 10,977 protein-coding genes identified.BUSCO analysis showed that the completeness of the genome annotation is 95.76%.The assembly and annotation of this genome demonstrated a high degree of continuity and integrity.The genome herein reported will shed light on the aquatic adaptation of the megalopteran larvae as well as other benthic invertebrates and the genetics and evolution of this archaic holometabolan group.

Methods
Sampling and sequencing.The larva of A. orientalis used in this study was collected in Juma River, Shisandu, Fangshan District, Beijing, China on March 28, 2021.The genomic DNA was extracted by the G2 method using QIAGEN ® Genomic Kit (Cat#13343, QIAGEN).The DNA was purified after extraction due to impurities.
For Nanopore sequencing, 1D library was constructed with LSK109 kit on a PromethION sequencer.There are 153.75Gb data processed (coverage: 280.57X) totally.For Circular Consensus Sequencing (CCS), SMRTbell library was constructed with SMRTbellTM Express Template Prep Kit 2.0.Long DNA fragments of the SMRTbell library were sequenced on a PacBio Sequel II sequencer and the insert fragment size was 15 kb.The HiFi (High fidelity) reads were obtained by using CCS version 4.2.0 (--min-passes 1 --min-rq 0.99 --min-length 100) to process the offline data of PacBio Sequel II sequencer.One SMRT cell was processed resulting 416.42 Gb of subreads (coverage: 759.91 X).There are 25.36 Gb of HiFi reads (mean length: 14.66 kb, N50 length: 14.68 kb) obtained from the PacBio Sequel II platform for genome assembly polishing after calling CCS.For next-generation sequencing (NGS), the library was constructed by using the MGIEasy DNA kit, with an insert fragment size of 300-500 bp.The DNA library was sequenced using 150 bp paired-end (PE) reads on the MGI-2000 platform according to the protocol.There are 176.05Gb (coverage: 319.44 X) raw data generated by the MGI 2000 platform.
The Hi-C technology was used to assist the genome assembly at the chromosome-level, which has been applied to capture whole genome chromatin interactions 18 .The Hi-C fragment library was constructed and sequenced using the Illumina Nova-Seq 6000 platform according to the previously published protocol 19 .Hi-C analysis was performed using tissues from the same larva.The Hi-C library was sequenced with PE reads of 150 bp (insert fragment size was 300-500 bp).There are 53.04Gb of Hi-C raw data (176,791,896 PE reads) generated totally.
For RNA-seq, the RNA library was constructed with Qiagen Kit using 500 ng RNA with 12.02 Gb raw data (coverage: 21.93 X) obtained from the MGI-2000 platform.For PacBio Iso-seq sequencing, 500 ng RNA was reverse transcribed into cDNA and amplified using Iso-seq Express Oligo Kit with cDNAs purified using ProNex Beads, and a library was constructed using BluePippin with an insert fragment size of 0.5-6 kb.There are 55 Gb raw data (coverage: 100.36 X) obtained.
Genome size estimation and assembly.The raw data of NGS reads which sequenced on the MGI-2000 platform were filtered by using fastp version 0.21.0 20 (n 0-f 5-f 5-t 5-t 5-q 20) preprocessor to remove low-quality reads and obtain clean data.The quality of clean data was controlled using FastQC version 0.11.8 21(--extract).Then a part of the clean data after quality control was used for genome survey and all clean data was used to correct the genome assembly.
The genome survey analysis was conducted to infer the genomic characteristics of A. orientalis and develop a reasonable assembly plan before assembling the genome.There are 112.46Gb MGI DNA raw data used for k-mer analysis to estimate the genome size and heterozygosity.The frequency distribution analysis on quality filtered reads was performed using KMC version 3.1.1 22(-k35 ci1-cs1000000) with 35-mer.The short segment data at corresponding depths was simulated using the genome of Arabidopsis.The heterozygosity of A. orientalis was estimated by performing k-mer curve fitting under different gradient combinations of heterozygosity.The genome size is approximately 431.01 Mb and the heterozygosity is 1.2% based on the frequency distribution analysis of 35-mer (Supplementary Table 1).
On the Nanopore sequencing platform, the process of converting potential signals generated by DNA or RNA strands passing through nanopores into corresponding base sequences is called base-calling.The fastq format of raw reads with mean_qscore <7 23 was filtered using the official tool Guppy version 3.2.2+ 9fe0a78 24 (-c dna_r9.4.1_450bps_fast.cfg)to obtain the pass reads.Then the pass reads can be directly used for subsequent assembly.The genome was assembled using the NextGraph (default parameter) module in NextDenovo version 2.3.1 (default parameters, reads_cutoff as 1k,seed_cutoff as 50k) (https://github.com/Nextomics/NextDenovo)after correcting and trimming the raw data using NextCorrect (default parameter) module.A genome draft of 567.31 Mb was generated through denovo assembly.The genome preliminary assembly was corrected using Nextpolish version 1.3.0 25 with default parameter.The Nanopore third-generation data were corrected three times and the PacBio HiFi reads were corrected three times, while NGS data were corrected four times.The polished genome size is 551.29 Mb and contig N50 is 7.82 Mb after decontamination.
High-quality reads were obtained by filtering the original off-line data of the sample.The reads mapped uniquely to the genome at both ends of PE for subsequent analysis were extracted after removing the duplicate reads.The effective interaction and the proportion of sequences with self-cycle and biotin at the end was predicted using the position information of the DpnII site in the genome sketch.Finally, the genome assembly was assisted by analyzing the interaction between sequences.The data obtained by sequencing was the raw off-line sequence containing sequencing connector sequence and low-quality sequence.Fastp version 0.21.0 20 (-n 0-f 5-f 5-t 5-t 5-q 20) was used to filter the original sequence to ensure the quality of the analytic data and obtain high-quality clean reads.Then the duplicate reads were removed before the subsequent analysis.There are 53.04Gb of Hi-C raw data (176,791,896 PE reads) generated after filtering and 52.46 Gb (coverage: 95.73 X) of Hi-C filtered data (350,924,920 clean PE reads) used to assist the chromosome-level assembly.
Single-ended alignment of the sequenced Reads1 and Reads2 with the assembled genome sequence was performed to obtain localization information using Bowtie2 version 2.3.2 26 (alignment mode: -end to end; parameter: --very sensitive -L 30) due to the unique nature of the construction of the Hi-C library.Then the linked sites (enzyme cleavage point reconnection) in the unmapped PE reads after comparison were found by interception and comparison again.Finally, the PE reads of two comparisons were merged and the proportion of uniquely mapped PE reads was calculated.The unique read pairs around the DpnII cleavage site for comparison were determined by comparing and analyzing them.Hi-C interaction signals were used as a measure of the degree of correlation between different contigs by standardizing the DpnII cleavage sites on the genome sketch.For the genome sketch with a karyotype of 2n, using LACHESIS 27 (https://github.com/shendurelab/LACHESIS) software, the contig sequence of the sketch was clustered into 13 pseudo-chromosomal groups using an agglomerative hierarchical clustering algorithm.Contig sorting was performed within the cluster group of each pseudo-chromosome.Finally, the final chromosome-level genome sequence was obtained by adding 100 N connections after sequence and direction of the contig determined.There are 100,691,193 unique PE reads retained, including 65,757,980 effective interaction PE reads after mapping them onto the genome draft.There are 170 contigs attached to 13 pseudo-chromosomes (Fig. 2a).The length of 13 pseudo-chromosomes ranged from 9.51 Mb to 62.25 Mb (Fig. 2b), respectively, and the total length of 13 pseudo-chromosomes is 546,614,392 bp, accounting for 99.75% of the genome size (Table 1).The size of the chromosome-level genome obtained ultimately is 547.98 Mb, with 184 contigs and 27 scaffolds.The longest contig and scaffold is 20.56 Mb and 62.25 Mb, respectively, and the N50 length of the contig and scaffold is 7.77 Mb and 50.53 Mb, respectively.
Contigs were sorted within the cluster group of each pseudo-chromosome.In the final chromosome-level genome obtained, 100 kb was taken as a bin.The number of Hi-C read pairs covering any two bins was used as the intensity signal for the interaction between two bins.In the Hi-C interaction heatmap of pseudo-chromosomes, the heatmap coordinates represented pseudo-chromosomes, and the color of each point represented the log value of the corresponding genome bin pair interaction intensity, which increased sequentially from yellow to black.
Gene structure and functional annotation.SSR sequences in the genome were analyzed using GMATA version 2.2 28 with default parameters since it could identify microsatellite sites in sequences.Tandem repeats (TR) in the genome were analyzed using TRF version 4.07b 29 with default parameters.The Trans-posable elements (TE) of this species were predicted using RepeatMask version 1.331 30 (http://www.repeatmasker.org) based on the final-constructed repeat sequence database.There are 120,522 TRs with a total length of 4,359,538 bp (Supplementary Table 2), accounting for 0.80% of the whole genome.The total length of 2,401,248 TEs is  1), the GC content of each pseudo-chromosome (2), protein coding genes (3), the density of repeat sequences (DNA elements (4); SINE, short interspersed elements (5); LINE, long interspersed elements (6); LTR, long terminal repeat elements (7); simple repeats ( 8)), respectively.
For RNA-seq-based gene prediction, the clean data was compared to the reference genome using STAR version 2.7.3a 41 after data quality control.Then the transcripts of the clean data were assembled using StringTie version 1.3.4d 42with default parameters based on the results of the comparison genome and 6,702 second-generation transcript sequences were obtained for subsequent analysis.For ISO-seq-based gene prediction, high-quality reads of insert were obtained after the connector sequence was removed from PacBio offline data and the same polymerase read was self-corrected.The full-length transcript was identified and the primer sequence was removed using Lima version 2.2.0 (https://lima.how/) with default parameters.The full-length non-chimeric reads with the end poly-A sequence removed (full-length non-concatemer reads) were obtained by calling the Refine process in Isoseq 3 (https://github.com/ylipacbio/IsoSeq3).A cluster was performed on the full-length non-chimeric reads obtained after the refining.The high-quality consensus reads were mapped to the reference genome by using Minimap2 version 1.0 43 , and the sequences of multiple gene loci on the alignment were removed from the alignment.Then the subprogram in cDNA_Cupcake (https://github.com/Magdoll/cDNA_Cupcake/wiki) was used to filter and further remove the redundancy, and the alignment results before and after redundancy removal were counted.A total of 7,405 third-generation transcripts were obtained for downstream gene prediction and analysis.There are a total of 7,949 genes predicted (Table 3) using PASA version 2.3.3 44 based on the second and third-generation transcripts obtained from the above analysis, and corresponding training models were obtained for ab initio prediction.
The species prediction model was obtained by selecting reliable genes for model training through AUGUSTUS version 3.3.1 45 based on transcriptome prediction.There are 11,167 genes finally predicted (Table 3) using AUGUSTUS version 3.3.1 45 for ab initio prediction of gene structure based on this training model.
The gene set of the initial genome of A. orientalis was obtained by integrating GeMoMa 31 gene prediction results, PASA 44 gene prediction results and ab initio gene prediction results, using Evidence Modeler (EVM) version 1.1.1 46(--segmentSize 1000000 --overlapSize 100000) with a certain weight value (EVM weights: PASA 10, GeMoMa 5, AUGUSTUS 1).The final gene set was obtained by removing genes that contain TEs and encoding errors from the initial genome gene set through TransposoonPSI (http://transposonpsi.sourceforge.net/).A total of 10,977 genes (Table 3) were predicted, with an average gene length of 25,723.76bp and an average Coding DNA Sequence (CDS) length of 1,644.25 bp, while the average exons number per gene is 6.28 bp, the mean exon length is 261.7 bp and the mean intron length is 4,557.88bp (Table 3), respectively.Genomic ncRNA was predicted using Infernal version 1.1.2 47with default parameters compared with the Rfam database 48 , while tRNA was predicted using tRNAscan-SE version 2.0 49 (--thread 4 -E -I) and rRNA and its various subunits were predicted using RNAmmer version 1.2 50 (-S euk -m lsu,ssu,tsu -gff) to construct the model based on the assembled genome sequence.The above results were further integrated to obtain the predicted ncRNA in the genome.There are 1,153 ncRNA sequences annotated in total, including 156 rRNAs, 190 small RNAs, 15 cis-regulatory elements, and 792 tRNAs (Supplementary Table 3).

Data records
The raw Nanopore, PacBio, Hi-C, MGI, RNA-seq, and Iso-seq data were submitted to the Sequence Read Archive at NCBI under accession numbers SRP464006 56 .
The genome assembly data had been submitted to GenBank with accession number GCA_034766995.1 57 .The genome annotation GFF, CDS sequences, and protein sequences are available in Figshare 58 .
technical Validation assessment of the genome assembly and annotation.BUSCO was used to evaluate genome completeness at the chromosome-level according to the arthropoda_odb10 database.There are 1,003 (99.01%) complete genes, including 999 (98.62%) single-copy genes, 4 (0.39%) duplicated genes, 3 (0.30%) fragmented genes, and 7 (0.69%) missing genes identified (Table 5), indicating that the majority of conserved genes were assembled relatively complete and accurate.BUSCO was also used to evaluate the predicted gene set and about 970 (95.76%) (Table 5) of complete gene elements in the annotated gene set with 967 (95.46%) single-copy genes, 3 (0.30%) duplicated genes, 7 (0.69%) fragmented genes and 36 (3.55%) missing genes, indicating that the majority of conservative gene predictions were relatively complete and the prediction results are highly reliable.

Fig. 2
Fig. 2 Hi-C contact map and overview of the genomic landscape of Acanthacorydalis orientalis.(a) The heatmap shows the strength of interval interactions within pseudo-chromosomes.Resolution: 100 kb.The frequency of Hi-C interactive links is represented by color, ranging from yellow (low) to black (high).(b) Distribution of genomic features of A. orientalis.Blocks on the outmost circle represent all 13 pseudochromosomes of A. orientalis.Peak plots from outer to inner circles represent the length of each pseudochromosome (1), the GC content of each pseudo-chromosome (2), protein coding genes (3), the density of repeat sequences (DNA elements (4); SINE, short interspersed elements (5); LINE, long interspersed elements (6); LTR, long terminal repeat elements (7); simple repeats (8)), respectively.

Table 1 .
Pseudo-chromosomes length in the genome of Acanthacorydalis orientalis.

Table 2 .
Information of 11 species used in this study.

Table 3 .
Gene prediction results based on three strategies.

Table 4 .
Statistics for functional annotation of protein-coding genes.

Table 5 .
BUSCO evaluation statistics for genome assembly and annotation of Acanthacorydalis orientalis.